s-grams: Defining generalized n-grams for information retrieval
نویسندگان
چکیده
n-grams have been used widely and successfully for approximate string matching in many areas. s-grams have been introduced recently as an n-gram based matching technique, where di-grams are formed of both adjacent and non-adjacent characters. s-grams have proved successful in approximate string matching across language boundaries in Information Retrieval (IR). s-grams however lack precise definitions. Also their similarity comparison lacks precise definition. In this paper, we give precise definitions for both. Our definitions are developed in a bottom-up manner, only assuming character strings and elementary mathematical concepts. Extending established practices, we provide novel definitions of s-gram profiles and the L1 distance metric for them. This is a stronger string proximity measure than the popular Jaccard similarity measure because Jaccard is insensitive to the counts of each n-gram in the strings to be compared. However, due to the popularity of Jaccard in IR experiments, we define the reduction of s-gram profiles to binary profiles in order to precisely define the (extended) Jaccard similarity function for s-grams. We also show that n-gram similarity/distance computations are special cases of our generalized definitions. 2006 Elsevier Ltd. All rights reserved.
منابع مشابه
Comparison of s-gram Proximity Measures in Out-of-Vocabulary Word Translation
Classified s-grams have been successfully used in cross-language information retrieval (CLIR) as an approximate string matching technique for translating out-of-vocabulary (OOV) words. For example, s-grams have consistently outperformed other approximate string matching techniques, like edit distance or n-grams. The Jaccard coefficient has traditionally been used as an s-gram based string proxi...
متن کاملStatistical Phrases in Automated Text Categorization
In this work we investigate the usefulness of n-grams for document indexing in text categorization (TC). We call n-gram a set tk of n word stems, and we say that tk occurs in a document dj when a sequence of words appears in dj that, after stop word removal and stemming, consists exactly of the n stems in tk, in some order. Previous researches have investigated the use of n-grams (or some varia...
متن کاملEvaluation of N-grams Conflation Approach in Text-Based Information Retrieval
This paper examines a conflation method based on the N-grams approach and evaluates its performance relative to the results achieved by other techniques such as Porter algorithm and successor variety stemming. In addition to that, an alternative way of enhancing the N-grams method, derived from the concept of inverse frequency weighing, is introduced and evaluated. The experimental results gene...
متن کاملIndexing Using Both N-Grams and Words
Goals The Johns Hopkins University Applied Physics Laboratory (JHU/APL) is a first-time entrant in the TREC Category A evaluation. The focus of our information retrieval research is on the relative value of and interaction among multiple term types. In particular, we are interested in examining both words and n-grams as indexing terms. The relative values of words and n-grams have been disputed...
متن کاملExperiments in spoken document retrieval using phoneme n-grams
In spoken document retrieval, speech recognition is applied to a collection to obtain either words or subword units, such as phonemes, that can be matched against queries. We have explored retrieval based on phoneme n-grams. The use of phonemes addresses the out-of-vocabulary problem, while use of n-grams allows approximate matching on inaccurate phoneme transcriptions. Our experiments explored...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Inf. Process. Manage.
دوره 43 شماره
صفحات -
تاریخ انتشار 2007